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Abstract 

A recursive estimator of the conditional geometric median in Hilbert spaces is stud- 
ied. It is based on a stochastic gradient algorithm whose aim is to minimize a weighted 
L\ criterion and is consequently well adapted for robust online estimation. The weights 
are controlled by a kernel function and an associated bandwidth. Almost sure conver- 
gence and L 2 rates of convergence are proved under general conditions on the condi- 
tional distribution as well as the sequence of descent steps of the algorithm and the 
sequence of bandwidths. Asymptotic normality is also proved for the averaged version 
of the algorithm with an optimal rate of convergence. A simulation study confirms the 
interest of this new and fast algorithm when the sample sizes are large. Finally, the abil- 
ity of these recursive algorithms to deal with very high-dimensional data is illustrated 
on the robust estimation of television audience profiles conditional on the total time 
spent watching television over a period of 24 hours. 

Keywords: asymptotic normality averaging, CLT, kernel regression, Mallows-Wasserstein 
distance, online data, Robbins-Monro, robust estimator, sequential estimation, Stochastic 
gradient. 



1 Introduction 



It is not unusual nowadays to get large samples of high-dimensional or functional data to- 
gether with real covariates that are correlated with the functional variable under study. The 
estimation of how the shape of the functional response may depend on real or functional 
covariates has been deeply studied in the statistical literature : linear models for functional 
response have been propose d by |Faraway 1 1997}, Cuevas et al. ( 2002 1 or |Bosq| ( |2000} (see 
also Ramsay and Silverman (2005}) and Greven et al. ( 2010} w hereas nonlinear relation- 
ships are studied in |Lecoutre | |1990} , |Chiou et al.| 1 )2004 >, |Lian| (2007}, |Cardot| ( |2007} , |Lian 
( [20TT} and |Ferraty et al.| ( [20TT} . 

The main drawback of all the above mentioned estimators, whose target is the condi- 
tional expectation, is that they all rely, explicitly or not, on least squares and are conse- 
quently sensitive to outliers. In such a context of large samples of high dimensional data, 
outlying observations, which may not be uncommon, might be hard to detect with auto- 
matic procedures. Directly considering robust indicators of centrality such as medians is 
a way to deal with this issue. If Y be a random variable taking values in a Hilbert space 
H, its geometric median m (also called spatial median or Lj -median, see Small ( 1990} for a 
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survey) is defined as follows 



m := argmin flgH E [\\Y 



\y\ 



(i) 



The median m is uniquely defined under simple conditions when the dimension of H is 
larger than or equal to 2, it has a 0.5 breakdown point ( [Kemperman ( 1987} ) as well as a 
bounded gross sensitivity error (Cardot et al. | |2011[ >). When one has a sample at hand, algo- 
rithms based on the minimization of the empirical version of risk ([I]) have been proposed by 
Vardi and Zhang (|2000|> and properties of such robust estimators can be found in the recent 
review by |Mottonen et aT] ( |2010| >. Nevertheless, these computational techniques may not 
be able to handle very large samples of high-dimensional data since they require to store 
all the data. An alternative approach, developed by Chaouch and Goga ( |2012[ > and which 
can cope with this issue, consists in considering unequal probability sampling techniques 
in order to select, in a effective way, subsamples with sizes much smaller than the initial 
sample size. 

We suggest in this work another direction based on recursive techniques which do not 
require to store all the data. Another interest of these recursive approaches is that they allow 
automatic update of the estimators if, for example, the data arrive sequentially. Recently, a 
simple recursive algorithm which gives efficient estimates of the geometric median in sep- 



arable Hilbert spaces has been proposed by Cardot et al. (2011 1. It is shown that averaged 



versions of classic stochastic gradient algorithms have a limiting normal distribution that 
is the same as the distribution of the static estimator based on a direct minimization of the 
empirical version of risk ([l}. 

In a finite dimension context, Cadre and Gannoun| ( |2000| > and Cheng and De Gooijer 
(2007) proposed to introduce a kernel function K in the empirical version of |[l]t in order to 
take covariate effects into account. The kernel weights are controlled by a sequence of band- 
width values that tends to zero when the sample size increases in order to build consistent 
estimates of the conditional geometric median. With the same ideas of local approximation 
of the conditional distribution, we study, in this work, a modification of the recursive al- 



gorithm suggested in Cardot et al. ( 2011[ >. It consists in introducing weights, controlled by 
a kernel function, in order to build consistent recursive estimators of the conditional geo- 
metric median. The response variable is also allowed to take values in a separable Hilbert 
space. For real response, recursive estimators of the regression function based on kernel 



weights have been introduced by Revesz ( 1977} whereas a deep study of their asymptotic 
properties, which also includes averaged estimation procedures, is proposed in Mokkadem 
eTaT] | |2009l 

The paper is organized as follows. In Section 2, we first define the stochastic gradient 
recursive estimator as well as its averaged version for the case of a real covariate. Note 
that our results could be extended to multidimensional covariates. We state the asymptotic 
normality, under general conditions, of the averaged algorithm in separable Hilbert spaces, 
with an optimal rate of convergence. The regularity hypotheses, which are much weaker 
than those of Cadre and Gannoun ([2000) r are also expressed in terms of the Wasserstein 
distance between the conditional distributions. 

In Section 3, a comparison of the static approach, which consists in minimizing the em- 
pirical version of risk <[T]), with the stochastic gradient estimator and its averaged version 
is performed on a simulation study. It confirms the good behavior as well as the stability, 
with respect to the descent steps, of the averaged algorithm. The ability of this estimator to 
deal with large samples of very high-dimensional data is then illustrated on the estimation 
of television audience profiles given the total time spent watching television. Proofs are 
gathered in Section 4. 
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2 Notations, hypotheses and main results 



Let (Y, X) be a pair of random variables taking values inHxK, where H is a Hilbert space 
whose norm is denoted by || • || . Suppose that X is continuous, and denote by p(x) its density 
at x G R. For any x in the support of X, denote by jl x the conditional law of Y given X = x. 
Consider, for (a,x) £HxR, the following functional 



G(ot,x) := p(x) E [||y ■ 



|y|| ix 



(2) 



The geometric median of Y given X = x, denoted by m(x), is defined as the solution of 
the following optimization problem: 



m{x) := argmin aGH G(oc,x). 



(3) 



The solution of (|3) is unique provided that the conditional distribution y. x is not supported 
by a straight line (Kemperman |1987)). We suppose from now on the following assumption. 

Al. For every x in the support of the probability density function p of the random variable 
X, fi x is not concentrated on a straight line: for all v G H, there is w G H such that 
(v, w) = and 

Var((a7,Y) |X = x) > 0. (4) 

Suppose we have a sequence (X n , Y n ) n >i of independent copies of (X, y). In the uncon- 
ditional case where the X variable is not taken into account, one can look for the uncondi- 
tional median, i.e. the minimum m defined by jl}. Under weak hypotheses, the median is 
uniquely defined as the zero of the derivative: 



-E 



y 



|y ■ 



We introduced in Cardot et al. (2011 1 the following recursive estimator of m: 

y«+i z n 



Zjj+i — Z n + 'y,) 



y 



n+l 



(5) 



where 7„ was a well-chosen deterministic sequence. In the present case, the law of Y n is 
not the conditional law ]d x , so this idea does not work directly. However, it is natural to 
see Y n as an approximate sample of y. x if X n happens to be very close to x. Therefore, a 
simple estimator can be built by introducing weights, through a kernel function K, whose 
properties will be specified later. We modify ^ as follows to take the weights into account, 
and define our recursive estimator of m(x): 



Z n+1 (x) = Z n (x) + 7„ 



y 



n+l 



Z n (x) 



\Y 



n+l 



Z n (x)\\ h n 



1 K f Xn+l 



h, 



(6) 



with two deterministic sequences of tuning parameters h n and 7,, whose properties are 
given below. 

For a constant sequence (h n ), this algorithm converges towards the minimum of the 
modified objective function: 



Gh(a,x) := E 



|y 



\Y\ 



-K 



X 



(7) 
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The partial derivative of Gh with respect to a is an element of H defined by 

<E> ft (a) := V a G h (oc,x) 
= — E 



Y-a l v (X-x 



\Y-oc\\ h 



h 



(8) 



We will see in Proposition 4.1 that, under suitable hypotheses, when h goes to zero, O/, 
goes to the gradient O of G, defined by: 



0(x,a) = -p(x)E 



" Y-a 




J|Y-a|| 


X = x 





(9) 



The idea of using a kernel, and of assigning a large weight to Y n when X n is close to 
x can only work if the conditional law y, x i varies, in some sense, regularly. A natural way 
of expressing this regularity is through the Mallows-Wasserstein distance. Let us recall its 
definition. 

Definition 1. Let \i and v be two probability measures on H with finite second order moments. Let 
C be the set of couplings of u and v, i.e. the set of measures n on H x H whose first marginal is ]i 
and whose second marginal is v. 

The Wasserstein distance between pi and v is given by: 

1/2 



W 2 (^v) = (htfj \\x-y\\ 2 dTT(x,y)^ . 
We may now state our assumptions. 



A2. The probability density function p of the random variable X is bounded and satisfies 
a uniform Holder condition : there are two constants /3 > and C2 > such that 

V(x,x') G R 2 , \p(x) - p{x')\ < C 2 \x - x'\P. 

We denote by p max = sup xeR p(x). 

A3. The gradient 0(x, a) defined by Q satisfies a uniform Holder condition with coeffi- 
cient jS. There is C3 > such that 



V(x,x') e R 2 ,Va G H, II -®{cc r x')\\ < C 3 \x-x'\P. 



(10) 



A4. The conditional law }i x = £(Y|X = x) varies regularly with x: there are two constants 
C4 and /3 such that 

m^ x ,^)<c 4 \x-x'\ p . (ii) 

A5. The kernel function K is positive, bounded with compact support and satisfies 

/ K(u)du = 1. 
Jr. 



A6. There is a constant C(, such that: 

Va £ H,Vi, E 



ly-air 2 \X = x 



< C 6 . 



(12) 
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Remark 1. Without loss of generality, we suppose that the constant /3 in A2, A3 and A4 has always 
the same value. 

Assumption A3 is a regularity assumption that is required to control the approximation error 
and to prove the convergence of the algorithm. Assumption A4 seems to be more natural, and we 



prove in section 4. 1 that, together with A6, it implies A3. 

Hypotheses A2 and A5 are classical in nonparametric estimation and could be weakened at the 
expense of more complicated proofs. For classical properties of kernel estimators under general hy- 
potheses, see for example Wand and Jones <\1995 l 

Similarly, Hypothesis A6 is stated quite strongly here, in order to avoid additional technicalities 



in the proof of the asymptotic normality if the averaged algorithm. See Cardot et al. (2011) for a 
relaxed version, under which the same results should hold. Informally it forces the law to be "spread 
out" and this avoids pathological behaviors of the algorithm. 

We have three main results. The first one states the almost sure convergence of the 
algorithm. 

Theorem 2.1. Under assumptions A1-A3 and A5, and ifY^i In = °°> En In^n 1 < 00 as we ^ as 
En Jnh„ < oo, then, for all x such that p(x) > 0, 



\im \\Z n (x) — m(x)\\ = a.s. 



Remark 2. In the following, for simplicity, we choose the step size and window size as inverse 
powers ofn: 

In = K = * (13) 

With these choices the assumptions on the step sizes are: 

7 < 1, 2j-h>l, y + ph>l. (14) 

The assumptions on h and 7 are always satisfied if we choose 7 = 1 and h < 1. How- 
ever, as shown in the simulation study, the performances of algorithm Q strongly depend 
on the choice of the steps 7„ and particularly on the constant c 7 . Therefore, we also intro- 
duce the following averaged algorithm which is less sensitive to the choice of the step sizes 
7„ and has nice convergence properties, 



Z n +l(x) = - VZ k (x). 



(15) 



Our main result is a central limit theorem on this averaged algorithm. To adapt the proof 
of the corresponding CLT from Cardot et al. ( 2011[ >, we need a good a priori bound on the 
error Z n (x) — m. 

Proposition 2.2. Suppose that x is such that p(x) > and that 7 < 1, 27 — h > 1, 7 + /5fo > 1, 
and h(l + 2/3) > 7. Under Assumptions A1-A3 and A5, there exist an increasing sequence of 
events (H]v)ngNa an d constants Cn, such that Q = Unsn and 



VN, E 



\Z n - m{x)\[ 



< C N 



ln(n) 
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This proposition tells us that, up to a logarithmic factor, the optimal rates of convergence 
in nonparametric estimation can be attained for well chosen values of the parameter 7 and 
h. If 7 = 1 and h = (1 + 2/3) _1 , then, 

|| Z n - m(x)\\ 2 = O p (ln(n) „-W (2/3+1)) _ (16) 

Finally our main result is the following central limit theorem for the averaged algorithm. 

Theorem 2.3. Assume Al, A2 and A4-A6. Let x satisfy p(x) > 0. If 7 < 1, 27 - h > 1, 
7 + /S/z > 1 and ft > (2jS + 1) _1 , f/ien: 



En 1 



(z„ - m(x)) —=^j\f (o,r _1 zr 



where 



p(x) / K 2 (u)du E 



(Y - m(x)) ® (Y - m(x)) 



E 



| Y — m(x) 



iff 



||Y — m(x)|| 

(Y-m(x)) <g> (Y-m(x)) 
||Y-m(x)|| 2 



X 



X 



(17) 



(18) 



As shown in Cardot et al. ( 2011} in the unconditional framework, the operator T has 
a bounded inverse under assumption Al, so that the asymptotic variance operator is well 
defined. Let us also remark that with our assumptions on the sequence of bandwidths, we 
have 



E 

k=l 



1 

h 



h n l+h 



+ [nh 



,-1 



(19) 



Consequently, the rate of convergence in the CLT is of order \/nh n , which is the usual rate 
of convergence in distribution for nonparametric regression, provided that the bias term 
is negligible compared to the variance. This latter co ndit ion is ensured by the additional 
condition h < (2/3 + 1) -1 and we have, with Theorem 



2.3 



r nh n (Z n 



m 



(*)) 



f o, 



l + h 



r x Er 



-1 



As in the real regression case (see |Mokkadem et al. 1 2009 1) it turns out that the averaged 
estimator has a smaller asymptotic variance, with in our case a factor (1 + h) , than the 
classical kernel estimator which minimizes the empirical version of risk j7|. 

Remark 3. Proceeding exactly as in the proof of Theorem |2J} it is possible to establish a CLT for 
another weighted version of the algorithm Z n = A Vhk(Zk ~ m )> which is the empirical mean 
of \fh~niZn — m). Under the same assumptions of Theorem 2.3 one has: 



s/nZ n 



>A/"(o,r- 1 sr 



— 1 



3 Examples 

We first consider a simple simulated example in order to compare the performances of the 
averaged algorithm with the more classic static one as well as the recursive Robbins-Monro 
estimator without averaging. Then, the ability of our recursive averaged estimator to deal 
with large samples of very high-dimensional data is illustrated on the robust estimation of 
television audience profiles, measured at a minute scale over a period of 24 hours, given the 
total time spent watching television. All functions are coded in ® ( |R Development Core 
Team ( |2010[ |) and are available on request to the authors. 
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= 1/2 





7 



3/4 1 7 



In this picture we represent the possible choices for the parameters h and 7, when j6 varies. On 
the left is the most regular case where /3 = 1, on the right we set /3 = 1/2. In both cases, if (j,h) 



lies in the lighter region, Theorem 2.1 holds and the algorithm converges. In the middle region, 
the algorithm converges and the additional convergence estimate of Proposition 2.2 holds. Finally, 



if (7, h) is in the darker region, the CLT of Theorem 2.3 holds. All these two regions get smaller 
when f> is small. Note that even in the most regular case /5 = 1, in order to fulfill the hypotheses of 



Theorem 2.3 it is necessary to choose 7 larger than 2/3 and h larger than 1/3. 



Figure 1: Possible choices for h and 7. 



3.1 A simulated example 

Consider a Brownian motion Y measured at d equispaced time points in the interval [0,1], 
so that we have Y = (Y(fi), . . . , Y{t^)). Besides, suppose that we know the mean value 
X = Y(t)dt of each trajectory Y. We can look for the conditional (geometric) median of 
vector Y given X. The joint distribution of (Y, X) is clearly Gaussian with EY = 0, EX = 0, 

Cov(Y(f y ),Y(f*)) = mm(tj,ti), Var(X) = ^ and Cov(X, Y(f ; )) = tj (l - *A . 

Consequently, the distribution of Y given X = x is Gaussian with conditional expectation, 
for ;' = 1, . . . , p, 

E[Y(tj)\X = x] =|ty(2-ty)*, 

and a covariance matrix that does not depend on x. By symmetry of the Gaussian distri- 
bution, it is also clear that the conditional expectation is equal to the conditional geometric 
median, when H = R d equipped with the usual Euclidean norm, so that 

m(t j ,x) = -t j {2-tj)x. (20) 

The hypotheses on the density p are clearly satisfied since X is a Gaussian random variable. 
Furthermore, the Wasserstein distance between two Gaussian laws with expectations m\ 



and mi and the same covariance matrix is simply \\ni\ — m.2\\, (see e.g. Givens and Shortt 
( 1984| ) so that we can deduce, with ( [20] , that /3 = 1 in Assumption A4. 
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We draw n i.i.d. copies of (Y, X) and we focus in this simulation study on the geometric 
median of Y given x = 0.39, which corresponds to the value of the third quartile of X. Note 
that our conclusions remain unchanged for other non extreme values of X. 

We first compute the static estimator, named "static kernel" in the following. It is based 
on a direct minimization, with the Weiszfeld's algorithm (see Vardi and Zhang ( |2000| | and 
Mottonen etaL1 ( |2010| ), of 



n 

H4^a?i || Y; - a || , (21) 

1 = 1 

where w, = [E"=i K(h~ l (X f — x) )] 1 K(h~ l (X; — x) ) and K is the Gaussian kernel. 

The Robbins Monro estimator Z n , defined in Q, and the averaged estimator Z n , de- 
fined in ( [15) , are run for 10 starting points chosen randomly in the sample. Among the 10 
estimations, we retain the one with the smallest empirical risk ( |2"T) . 

The accuracy of the different estimators in are compared, for different values of the 
bandwidth h and sample sizes n, with the quadratic criterion, 

R(m) = W(m(t j )-m(tj)f. (22) 

a ;=1 

Since ft = 1, we can choose 7 = 9/10 and h = 3/10, C/, = 1, so that the quadratic estimation 
error for the Robbins-Monro algorithm, will be, up to the ln(n) factor, of order n _6//10 (see 



Proposition 2.2 1. 



Note that, for simplicity of comparison with the static kernel estimator, we also consider 
fixed values for h n G {0.05,0.10,0.15,0.20,0.25} and take in this case 7 = 2/3. We are 
aware that the assumptions needed for the asymptotic convergence are not satisfied but the 
sample size is fixed in advance here. 

We first present in Table [T] the mean value, over 500 replications, of the MSE defined in 
( |22) , when estimating the conditional median with a sample size of n = 500 in dimension 
d = 100. For comparison and interpretability of the results, note that 100R(0) = 18.4. 

We note that, when the sample size is moderate (i.e. n = 500), the interest of considering 
the averaged recursive estimation procedure is less evident than in the unconditional case 
(see [Cardot et a\\ |20To) ) since the Robbins-Monro estimator Z n defined in (|6) can perform, 



for well chosen values of the tuning parameters c 7 and h n , nearly as well as the static esti 
mator. Nevertheless, we can remark that Z n is highly sensitive to the values of the tuning 
parameters and its performances deteriorate much with small variations of these parame- 
ters as seen in Table [l] This is not the case of the averaged estimator Z n , defined in ( 15 1, 



which is much less sensitive and thus allows less sharp choices of the values of the tuning 
parameters provided the descent steps do not force the algorithm to converge too rapidly. 
We note again (see Cardot et al. < |2010| >) that for too small values of c 7 (i.e. c 7 = 0.1), the 



algorithm converges too quickly and averaging leads to estimations that are outperformed 
by the direct Robbins-Monro approach. A way to deal with this drawback is to perform 
averaging only after a certain number of iterations. All these remarks are clearly illustrated 
in Figure [2] which presents the estimation error, defined in ( |22) , for both algorithms and for 
different values of c 7 . 

When the sample size gets larger the interest of the averaging step becomes clearer since 
the estimation error of the Robbins-Monro estimator are always larger as soon as c 7 > 1 (see 
Table [2). Furthermore, the estimation errors of the static kernel estimator and the averaged 
recursive one are also now very close to each other. 
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Table 1: Mean estimation errors (x 100) of the different estimators, for n = 500, d = 100, 
and descent parameter 7 = 2/3 when h n has a constant value and 7 = 0.9 when h„ = n~ h 
with h = 7/3 = 0.3. 





Bandwidth h n 




0.05 


0.10 


0.15 


0.20 


0.25 


n -u.3 


Static kernel 


0.349 


0.179 


0.148 


0.172 


0.245 




Robbins Monro 














c 7 = 0.1 


0.689 


0.625 


0.659 


0.769 


0.912 


2.458 


c 7 = 0.3 


0.370 


0.194 


0.159 


0.178 


0.253 


0.332 


c 7 = 1 


0.590 


0.297 


0.229 


0.240 


0.297 


0.183 


C'y 3 


1.177 


0.647 


0.486 


0.425 


0.453 


0.248 


Averaged 














c 7 = 0.1 


1.047 


1.000 


1.051 


1.160 


1.336 


2.995 


c 7 = 0.3 


0.406 


0.213 


0.178 


0.202 


0.287 


0.534 


c 7 = 1 


0.402 


0.195 


0.160 


0.182 


0.252 


0.192 


C'y 3 


0.443 


0.209 


0.163 


0.252 


0.256 


0.170 



Table 2: Mean estimation errors ( x 100) of the different estimators, for n = 2000, d = 100, 
and descent parameter 7 = 2/3 when h„ has a constant value and 7 = 0.9 when h n = n~ h 
with h = 7/3 = 0.3. 





Bandwidth h„ 




0.05 


0.10 


0.15 


0.20 


0.25 


n -u.3 


Static kernel 


0.082 


0.053 


0.060 


0.099 


0.176 




Robbins Monro 














c 7 = 0.1 


0.139 


0.128 


0.149 


0.205 


0.324 


1.321 


c 7 = 0.3 


0.095 


0.061 


0.065 


0.103 


0.181 


0.083 


c 7 = 1 


0.173 


0.104 


0.098 


0.126 


0.194 


0.061 


C'-y 3 


0.403 


0.230 


0.175 


0.192 


0.253 


0.096 


Averaged 














c 7 = 0.1 


0.240 


0.237 


0.270 


0.332 


0.484 


1.712 


c 7 = 0.3 


0.091 


0.058 


0.065 


0.102 


0.183 


0.138 


c 7 = 1 


0.090 


0.057 


0.063 


0.101 


0.178 


0.060 


C'y 3 


0.097 


0.058 


0.064 


0.101 


0.180 


0.057 
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Figure 2: Comparison of the two recursive algorithms according to the mean square error 
of estimation for different values of c 7 (with a logarithmic scale). The sample size is n = 500 
and d = 100. 
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3.2 Television audience data 



We have a sample of n = 5422 individual audiences measured every minute over a period 
of 24 hours and by the Mediametrie company in France. For j — 1,. . ., 1440, an observation 
Yi(tj) represents the proportion of time spent by the individual i watching television during 
the / minute of this day. Thus, each vector Y, belongs to [0, l] 1440 . Note that in fact the first 
measurement t\ is made at 3 AM of day d and the last one just before 3 AM of day d + 1 (see 
Figure |3|. A more detailed description of these data can be found in Cardot et al.| ( 20lT) . 

We are interested in estimating television consumption behaviors, over a 24 hours pe- 
riod, according to the total time spent watching television. The covariate X, is the propor- 
tion of time spent watching television over the considered period, X, = Y; (£,■))/ 1440, 
for i = 1, . . . , n = 5422. We consider the quantile values of X which are, in the sample, 
q25 = 0.0599, ^50 = 0.128, ^75 = 0.225 and ^90 = 0.348. This means for example, that the 
ten percent of consumers with the highest consumption levels spend more than 34.8 % of 
their time watching television whereas the 25 % of consumers with the lowest consumption 
levels spend less than 6% of their time watching television. 

We have drawn in Figure |3]the estimated conditional median profiles with a bandwidth 
value set to h n = 0.05 and a descent parameter c 7 = 0.5, for x S {^25/^50/ ^75/ ^90}- For 
comparison and better interpretation, we have also plotted the overall geometric median 
as well as the mean profile. One can note that the shape of the conditional profiles strongly 
depend on the value of the covariate and that multiplicative models that could be thought 
to be natural (see the simulation study), are in fact not adapted for modeling the conditional 
audience median profiles. This is clear if we compare, for example, the levels of the condi- 
tional median curves for x = ^75 and x = q^o at time 15 and at time 21. Around 21, their 
values are approximately the same and are close to the global maximum whereas at time 15 
the value of the conditional median for x = q^o is about twice the value of the conditional 
median for x = qj$. 

From a computational speed point of view, for one starting point, our algorithm, which 
takes less than two seconds, is about 70 times faster than the static estimator which requires 
140 seconds to converge. 



4 Proofs 

Notation. In all the proofs, x will be a fixed point in R satisfying p(x) > 0. Since x will not 
vary, we will abuse notation and drop it from various quantities. In particular, in the following 
m will denote the median m[x) of the conditional law ji x , and we will write Z n = Z n (x) and 
O(a) = 0(x,a). 

4.1 About the assumptions 

We begin by a simple geometric result on unit vectors. For a, b two points in H, let D(a, b) 
be the unit vector "starting" from a in the direction of b. Now if a, b, c are three points in H, 
such that || a — b\\ < \\a — c\\, Thales' theorem shows that: 

\\D(a,b) - D(a,c) || _ \\a + D(a,b) - a\\ 1 

\\b — c'\\ \\a — b\\ \\a — b\\' 

so 

\\D(a f b)-D(a f c)\\<f-^<f—^r. 

\\a — b\\ \\a — b\\ 
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Hours 



Figure 3: Estimation of the conditional median profile for different levels of total time spent 
watching television, on the 6th September 2010. 
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In any case, 

\\D(n h~\ - D(n r\\\ < 

min(||fl — b\\ , \\a — c\ 
We will need a "decoupled" version of this inequality: 

\\D(a,b) - D(a,c)\\ < f^J + f^f. 



(23) 



(24) 



We can now prove that A4 and A6 imply A3. Let x, x' be two real numbers in the support 
of p. Recall that ]i x denotes the law C(Y \X = x). Let Y and Y' be two random variables with 
respective laws u x and ji x i, such that their joint law n achieves the Wasserstein distance. Let 
us first show that: 



V« G H, ||E [D(a,Y)]-E [D(a,Y')) \\ < C \x-x'f. 

Fix an a G H. We have: 

||E [D(a,Y)] -E [D(a,Y')] || < ||E„ [D{a,Y) - D{a,Y')} 

<E 7r [||D( a/ Y)-D( a/ Y / )|| 

Now we use the geometric bound and Holder's inequality: 



(25) 



|E [D(a, Y)] - E [D(a, Y')] || < E^ 



| Y — Y' 
||Y-a| 



+ E,, 



|Y- Y'| 
\Y'-ol\ 



< 







1 






1 




(\ 


E 




E 








||Y-a|| 2 


+ \ 


|| Y' -a|| 2 


) 



IY-Y 



/ii 2 



The first term is bounded by 2yC6 thanks to A6. The second one is, by definition, the 
Wasserstein distance, and is bounded by C4 \x — x'y thanks to A4, therefore (25 1 holds. 



Since p is C 2 with compact support, the product 0(x, a) = p(x)E [D(a, Y)|X = x] is itself 
uniformly /3-H6lder continuous; in other words A3 holds. 

4.2 First properties 

Recall that, for z G H, O/, (z) is defined by <|8j as the conditional expectation of the step, with 
window size h. When h goes to zero, this "expected step" converges. 

Proposition 4.1. The expected step is bounded: 

3C,Vh > 0,Va G H, \\® h (ct)\\ < p max . (26) 

Moreover, under hypotheses A2, A3 and A5, there exists a constant C such that: 

-<E>(«)|| < Ch?, (27) 

where 0(x, a) is defined by 
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Proof. With our strong hypotheses this result is easy to prove. Indeed 



so that by Jensen's inequality 

11**0011 < v 

Moreover, 



\& h (a) -*(«) || < / 



<&(x',tx.)dx', 



x — X 
h 



\<&(x', a) — $?(x, a) || c?x'. 



Now we use Assumption A3 to bound the norm by C3 \x' — x\ , the compact support of the 
bounded function K (Assumption A5) and we integrate: 



<C 3 I -K 



x — x 



\x — x'\ dx' 



<C 3 J K(f)hWdt 



□ 



Thanks to this result, we have a natural decomposition of algorithm Let us introduce 
the two following quantities: 



D h (z) =O ft (z)-0(z) / 

Cn+l 



n+1 ~ 1 v I — x 



-K 



^n|| h n V 



In terms of these quantities, we can rewrite ^ as: 

Z n+ i =Z n - 7„0(Z n ) - j n D hn (Z„) - JnZn+l- 



(28) 
(29) 

(30) 



The first term D/, B (Z„) will be controlled by Proposition 4.1 The second term £ n+ i defines 
a sequence of martingale differences, since the conditional expectation given the sequence 
of c-algebra T n = o~(Z\, ...,Z n ) = o~(Y\, X\, ...,Y n , X n ) satisfies 

E [^ n+1 |J 7 „] = 0, a.s. 

For future reference, let us note the following bound on £ n : 



E 



E 



. c 

< 7-, fl-s 



|| 

2 ( X n -|_l — X 



K 



h„ 



\®h„(Zr, 



', a.s 



\®h n (Z n )\\ z , a.s 



(31) 
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4.3 Almost sure convergence 



In this section we prove Theorem|2.1| Define V„ = \\Z„ - m\\ . By |30), we have: 

Vn+l = V n + <f n ||0(Z M )f + \\D hn (Z n ) + ^+1 f 

+ 2 (Z n - m, 7„0(Z„)) + 2 7 „ (Z„ - m, D hn (Z„) + £ n+1 ) 
+ 2 7 2(0(Z, ; ) / D,„(Z )3 ) + ^+i). 

The first scalar product is non positive, we denote it by (—f]n)- We condition by T n : the 
£ n+ i in the scalar products disappear by the martingale property. Then we use Holder's 
inequality: 

E[V„ +1 |J- n ] < V n + 1 2 n \\<P{Zn)f + 2 1 2 n \\D hn {Z n )\\ 2 + 2 7 2 n 1E [||^+i|| 2 | T„ 
- Vn + 2 ln \\Z n -m\\ \\D hn (Z n )\\ + 2 7 2 ||<D(Z n )|| \\D K (Z n )\\. 

On the last term we use 2xy < x 2 + y 2 to get: 



■2 7 2 E 



+1 



T„ 



E[V n+ i|J-„] < y f! +2 7 ^||0(Z„)|| z + 3 7 ^||D,, i (Z 
- rjn+l^n \\Z n -m\\ \\D hn {Z n )\\ . 

On the last term we bound ||Z„ — m|| by (1 + V n ) to get: 

E[F n+ i|J-„] < (l+2 7n ||D; !H (Z, ; )||)V f; + 2 7 2 ||0(Z )3 )|| 2 + 3 7 2 ||D,„(Z )5 )|| 2 



+ 2 7 2 E 



2 7n ||D h|1 (Z n )|| - rjt, 



Finally we bound D/, n (Z„) by Chf r thanks to §27\ , E ||£ n +i| 
and ||0(Z„)|| by p(x). This yields: 



T n 



by C/h n thanks to d31 



E [V n+1 \T n ] < (l + 2C ln }£\ V n + 2p(x) 7 2 + 3C 2 7 2 n h 2 / 



2 

+ 2C^- + 2C ln h^-t] n 
n n 

< (1 + b n )V n + Xn ~ rj„, 
where b n = 2C r y n hn and Xn = 3C 2r y 2 }ii + 2C r y 2 h~ l + 2C 7n /z^ satisfy: 

Yj h n < °°/ J^Xn < °°- 

Therefore by the Robbins-Siegmund Lemma (Theorem 1.3.12 of |Duflo ( 19971 ), converges 
almost surely and Ym Vn < 00 • This implies that the limit of V n is zero, by the same argument 
than in |Cardot et al. ( [2011 >, assuming that j n = oo. 



4.4 Proof of proposition 2.2 



For the sake of clarity, we follow the same steps as the proof of Proposition 3.2 in Cardot 
et al. ( |2011| >, and emphasize the necessary changes. 
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Step 1 — a spectral decomposition. This step is exactly the same as in |Cardotetal.1 ( [20lT| > 
thanks to a spectral decomposition of T, we can define the operators: 

a k = l H -J k T, f> n = 0L n Ci n -l ■ ■ ■ Oil. 

Introducing the sequence of real functions, for n G IN, 

f n (x)=fl(l- lk x), 

k=l 

we see that each operator fi n can be also expressed as follows: 

fin* = /"( A ) ( 6a ' X ) 6a ' X E H ' 
AeA 

their inverses are bounded operators, and satisfy: f>n l x = YLaga fn 1 W { e A, x ) e A- 
Moreover there exist constants jq, %r, K3 such that: 



Vx E <i(T), Kiexp(— s„x) < f n (x) < K 2 exp (— s„x) , 



— — n ' 



1-7 



< K 3 , 



(32) 



where we recall that s n = YX=\ 7kr an d Ik = c ^ 



Step 2 — Decomposition of the algorithm. Recall the decomposition ( |30| , and rewrite 
the algorithm as follows: 



Z n+ i =Z n - Jn^n+l ~ Jn^iZn) ~ y„D hn (Z n ) 

= Z„ - JnZn+l ~ 7n(T(Z n - Hi) + S n ) - J n D hn (Z n ) 



(33) 



where S n = 0(Z„) — T(Z n — m) is the difference between the gradient of G and the gradient 



of its quadratic approximation. Compared to Cardot et al. ( 2011} , there are two differences: 
the martingale difference has changed, and there is an additional term y n D} tn (Z n ). There- 
fore: 

Vk, Z k+1 -m = oi k (Z k - m) - 7^+1 - y k S k - y k D h (Z k ). 
Rewriting 0L n -\0L n -i • • • &k+l as fin-if> k l , we get by induction, 



(34) 



where 



Z n -m = p n -i(Zx -m) + /S„_iM„ - fi n -\^n-i ~ fi n -lK-l' 



Rn=ElkP k % 

k=l 

n-1 

M n = ~ E Ik&Zk+l 

k=\ 

K = LnP k l D hk (z k ). 

k=l 



(35) 



At this point, the first and third term are the same as in Cardot et al. ( 2011} , the martingale 
has changed and there is an additional remainder term R' n . 
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Step 3 — The deterministic term. Just as in Cardot et al. ( 20111 , we get: 



E 



\\fi n ^(Z 1 -m)\\ 2 <Cexp(-2n l -Am 



I Zj — ml 



(36) 



Step 4 — The martingale. Still following Cardot et al. ( |2011[ >, we use the spectral decom- 
position to deal with the martingale part. The changes appear just before eq. (41) in that 
paper, where the bound on E[||^|| 2 ] has to be changed (from 1 to C/h n , using the new 



bound (|3T[>). Then we use the bounds p2) to get: 



E 



\Pn-lM„\ 



< c y 2* ( /"-i( A """) 

k<n-l hk V fk(^min) 



<c E 

k<n-\ 



7| 



exp 



1-7 



n l-T_Jtl-7 



(37) 



Once more, the first terms in the sum are negligible (thanks to the exponential), and we 
isolate the last terms, for k > l(n), where l(n) is given by 



l(n 



,1-7 



n 1 7 - c«ln(n) , 



(38) 



for some constant c a . Choosing c a large enough, the arguments from Cardot et al. ( |2011[ | 
ensure that the main contribution comes from the last terms. The number of terms, that is 
n — l(n), is of the order ln(n)n 7 , and 7fr n \/hi(n) is equivalent to cn h ~ 2,y . Therefore 



E 



||j8„-iM w | 



< C 



ln(n) 



-h ' 



(39) 



Step 5 — the error terms. 

The first error term is R n = f$ n -i YX=\ Ikfik "k> where 5k = $?(Zk) ~ r(Z; c — m). This one 
can be treated exactly as in Cardot et al. ( 2011| . We recall the definition of the event Cl^: 



N 



Vn > N,Vk > n - l(n), \\Z k (cv) - m\\ < 1 / K 
co, and ||^(<^)|| < C r ||Z^(o;) — m\\ 2 

V*,||4H|| <N. 



for a value of K to be chosen later, and l(n) defined by ( |38| l. Then, for any power of n (say 
n~ 42 ) there is a C such that, on and for n > N, 



\\Pn-lRn\f < 



CN 2 C 



+ V2 E 7k\\Z k -m\ 



n 42 K 2 



(40) 



k=l{n)+\ 



We now turn to the bound of the new error term R' n = j6„_i YJk=i l^Pk 1 ®h k (Zk)- To 
bound D^, we use (27): 

\\D h (z)\\ < Ch?. 
Therefore for N large enough, and for k > l(n), 

\\la N D hk (Z k )\\ < Ch^. 
For k smaller than l(n), we use the crude bound (Zjt) < p mi 



1. Finally we get: 



\ln N K\\ < ^42 + ( n " l ( n )hi(nX(„y 
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The last term is bounded by Cn P and dominates the first term. 
Finally, since by assumption, h(l +2/3) > 7, one gets 



llnXII < 



C 



(41) 



Now we use ( |36| , ( |39| , ( |40| ) and ( |4T) to bound the four terms that appear in ( |35| . We get, 
for n > N and some new constant C: 



E 



ln N ||Z„-m| 



Cln(n) C 

n A l(n)<k<n 



By the same induction than in Cardot et al. ( |2011[ >, we obtain the bound announced in 



Proposition 2.2 



4.5 Proof of Theorem |231 



The following proof follows the same guidelines as the proof of Theorem 3.4 in |Cardot et al. 
( 2011[ >. Again we emphasize the necessary changes due to the introduction of the kernel and 
of the conditional distribution. We first linearize the target function around the conditional 



median m as in (33 1: 



Vn, Z n+ i - m = (Ih - 7« r ) { z n -m)- J n £ n +1 ~ 7n$n ~ 7nD K (Z n ), 
where (£„) is a martingale difference sequence. Therefore, for all k, 

T(Z k - m) = 7" 1 ((Z k - m) - (Z k+1 - m)) - £ k+1 -5 k - D h (Z k ). 

Define now, 



T n := Z n - m, T„ := Z„ - m and M„+i := 

k=l 



and sum d42l over k 



A 1 



nrT » = E rr ( T * " r *+i) " E + D /^( z ic)) - M„ +1/ 

k=l > k k=\ 



so that 



En 1 
fc=l 'it 



: rr„ = 



En 1 



71 



i - a„ + a; - Ajf) + . n M n+Xl 



where 



r, 



jj+i 



A n 

In 

K--=t T k 

k=2 

K:=Y J {5 k + D hk {Z k )) 
k=l 



Ik 7k+i 



(42) 



(43) 
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Step Zero — convergence of covariance operators Our first task is to establish a central 

(44) 



limit theorem for the last term of d43|: 



n J_ 



=M„-^JV(0,L), 



Efc=i 



where E is the limiting covariance defined by ( (17) . On the space of linear operators on H 
we consider two classical norms, the (strong) operator norm and the Hilbert-Schmidt norm: 

lollop = sup{||Ay|| H ;||y|| <1}, 

/ „ \ 1/2 



\A\ 



v !=0 



where ey is an orthonormal base of H. The following lemma will be useful. 
Lemma 4.2. Define a random covariance operator E n by: 

L„ = h n B [£ n+1 <g> e, n +i\Fn] ■ 

Then: 

a/E,, — vE/ fl-s. 



(45) 



(46) 

n — 

7n particular, L„ converges to E «.s. m operator norm. Moreover, ifL n denotes the following 
averaged version ofL n : 

\ n \ in 

e„ = — — r X] t-^ic = = — r E E [ft+i ® ft+il-? 7 *] / 

Lk=l JjZ fc=l "* 2_*=l E fc=l 



H.-S. 



vE, fl.s. 



Finally, for any orthogonal projection operator P, 

E [Tr (LP)] ► E [Tr (LP)] . 



(47) 



(48) 



Remark 4. Lef us note ffoaf the convergence of square roots of covariance operators is equivalent 
to the convergence of the centered Gaussian laws with these covariances; see e.g. <\Bogachev\ 1998), 
Example 3.8.13. 

Proof. We first show that the convergence ((46) holds in operator norm. Recall that D(x,y) 
denotes the unit vector (y — x)/ ||y — x\\ . Let us rewrite E n . 



h n 



-E 



K 



2 ( — X 



D{Z n ,Y n+1 )®D{Z n ,Y n+1 ) 



h„O hn (Z„) ®<& hn (Z n 



Denote by (X, Y) a couple of random variables with the original joint law, and Y x be a 
random variable with law u x , independent from (X, Y). 

We decompose the difference E„ — E = D\ + D2 + D3 + D4 where 



Di 



fen 



E 



D 2 



-E 



-E 



K 



X n +i — x 



ft, 



K 



X-x 



D(Z n ,Y n+1 )®D(Z n ,Y n+1 ] 



D(m,Y)®D{m, Y) 



X-x 



h„ 
X-x 

K 

D 4 = -h n <f> hn (Z n )®<f> hn (Z n ) 



D 3 = — E 

h n 



K 



D(m,Y)®D(m,Y) 
D(m,Y x )®D(m,Y x ) 



-E 



K 



X-x 



D{m,Y x )®D(m,Y x ) 
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Note that only the first and the last terms are random; the others depend on n only through 
the quantity h n . For (a, b, c) G H 3 , it is easy to see that: 

\\D(a,b)®D(a,b) - D(a,c) ® D(a,c)\\ op < (\\D{a,b)\\ + \\D(a,c)\\) \\D(a,b) - D(a,c)\\ 



< 2 



\a — b\ 



\a — c\ 



\b — c\ 



where we used d24l in the last line. Therefore: 



K 



X-x 



Y — m 



+ 



\y-z„ 



m\ 



Conditioning on X and using Assumption A6, we get: 

||Di|| < VCe^E [K 2 ({X-x)/h n )} \\Z n 



m\ 



The boundedness of p and the finiteness of v 2 = J K 2 (u)du ensure 



E[K 2 ((X-x)/h n )} >p{x)v 2 , 



(49) 



by dominated convergence; therefore the sequence (/i„) _1 E [K 2 ((X — x)/h n )] is bounded. 
Since Z„ converges a.s. to m, \\D\ || converges a.s. to zero. 
The second term D2 is treated similarly; we get: 



K 



X-X 
K 



Y — ml 



Y T — m I 



I Y — Y r 



Recall that ^ = £(Y|X = x) and let be a coupling of ^ and }i x > that achieves the 
Wasserstein distance. We condition on the value of X and we apply Holder's inequality in 
order to bound the first integral with Assumption A6 and the second one with Assumption 
A4: 



lollop < ^ 



K z 



K 
K 



< 



h, 



K 

K 
E 



X-x 

h n 

X-x 
X-x 

h n 



1/2 



y-m\ 
1 



1/2 



K 



X-x 



\y — m\ 
W2 Ox,/** 
\X-x\P 



\y-y'\\ d}i Xi x{y,y r , 



2 

y-y dji XtX (y,y, 



1/2 



1/2 



4C 4 7Q^ / K 2 (y) |ypdy = 0(h{) >Q. 

J " 1 n— >oo 



In the third term D$, since Y x is independent of X we may write 

1e [K 2 ((X-x)A„)]-p(^ 2 ) £. 



D 3 



Thanks to ( [491 1, this converges to zero. Finally by Proposition 4.1 <J>^ (Z n ) is almost surely 

'' >0 r a.s. 



bounded, and since \\a ® b|| < \\a\\ \\b\\, 



\K*h n (Zn)®*h,,(Z n )\\ < h n \\<t> hn (Z 
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Therefore, E n converges to E in the operator norm. 

To prove the convergence of E„ in operator norm, observe that 



Ejj E 



op 



< -VErHlEit-El 



op 



Since ||E/ C — E|| converges to zero, the conclusion follows by the Toeplitz lemma. 

Let us show that these convergences hold in the Hilbert-Schmidt norm. For any a, 

Tr(a£g>a) = ||fl||. Therefore: 



a/E„ 



H.S. 



Tr(E„) 



E[K 2 ((X-x)/h n )] -h n \\<S> hn (Z n )\ 



>p(x)v 2 = Tr(E) = 

rwoo 

Another application of the Toeplitz lemma shows that 

Tr(E„) = -^f>(lE, 
i-*k=l K k=l \ nk 



H.S. 



-» Tr(E) 



v/E 



H.S. 



By the same reasoning as in Example 3.8.15 of (Bogachev 1998} , this implies the H.-S. con- 
vergences |46| and ( |47| . 

Finally, let P be an orthogonal projection operator. Choose a basis (e, ) , e isf of orthonormal 
eigenvectors of P: Pe, = or Pe, = e,. Since E„ is trace-class, so is E„P and: 

Tr (E„P) = £ (e;,E n Pe f ) = X] (Pe;,E n Pe f ) 



= E 



^nP^i/ V E n Pe; 
2 



E„P | 

H.S. 

^ll^llL=Tr(EP). 



E 



E„Pe ; - 



This convergence is almost sure. Since Tr(E fc ) < [K 2 ({X - x)/h k )] < C, the conver- 
gence also holds in L 1 by dominated convergence, and ( |48| holds. □ 



Step 1 — The CLT for the martingale. To prove the CLT yity , let us check that the as- 
sumptions of Theorem 5.1 in (Jakubowski 1988| > are fulfilled. Reminding |19| , translated in 
our context, these assumptions are: 



\/rj > lim P sup 



,l<k<n 



Ift+ill > n 



o, 



a - S - f E <&+l/ e i) (&+!' */) = 

Lk=l T k k=l 

( h " 00 ? \ 

Ve > Jim limsupP ( ^ E E (&+i>*j) > e J = °- 



!=1;'=N 



(50) 
(51) 
(52) 
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where (e n ) n eN is an orthonormal basis of H and tpu := (Ee;,ey). 

We deal with condition ( |50| ) by applying Markov's inequality. Let rj > 0. 



P sup 



,l<fc<n 



fc=i 



i " 



h \ p/2 



k+l\ 



for any p > 1. We chose an integer p such that p > 2. By convexity of the function x i-> 
we have, for any n, 



\\Zn + i\\ v <y- x [^w 



1 vv ( X n+1 — * 



Thus an easy computation yields 

2P- 1 p max / R KP(z)dz 



E[||ffn+lin < 



+ 2P- 1 E[||0 /7 „(Z n )f]. 



In the last term, 0/, n (Z„) is bounded, thanks to ( |26) . Consequently, there exists a constant 
C(p) (independent of n) such that 



E[n ?n+1 in<^. 

hi 



Hence we have, for a constant C'(p) independent of n, 



,l<fc<n 



p sup x Hr Hk+i\\ >y\< 



c( P )K 



P/2 n 



Lh- k P+1 < 



C'(p) 



n p/2-//(p/2-l)-l ' 



Since p > 2, one has p/2 - h(p/2 - 1) - 1 > and thus (|50J holds. 

Condition (|5T]) is a consequence of the law of large numbers for martingales. Let us 
consider (e n )neN an orthonormal basis of H. From the decomposition 

(£n+l/ e i) = E (Sn+l/*;)!^] + 

withe n+ i := (£„ + i,e z -) (£ M+1 ,e ; -) — E [(g r n+1/ g i ) (£ n+1 ,ey) | J"„], we have 

\ n \ n \ n 

n — r E (&+i/ g i) = ™ — r E E {Zk+i,ej) \Fk] + = — r E e ^+i 



fc=l ^ fc=l Lak=\ h k fc=l 



En 
fc=l 



— 1 " 

= (e ir Z n ej) + — r £ e fc+1 . 

L,k=i ht fc=i 



By Lemma 4.2 the matrix element (e„ E n gy) converges to t/^j The law of large numbers for 
the martingale (X^? =1 £ k+l) nE fj whose increasing process is of order n 1+3h yields 

1 - 



^ fc=i 



since h < 1, and condition ((511 is satisfied 
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It remains to check condition p2) . Let e > 0. Applying Markov's inequality, we have 



P 



h " 

f'-n \ 



h JL 



EE(M) > E I <^EE E <<W; 



k=\ j=N 



k=\ j=N 



i-i h k k=l j=N 



-ne LfefcJ 



E ( e i> z » e i) 

j=N 



Call the orthogonal projection on the e,, i > N. 



h JL 



K ( A l 



P ( ? E E < > e ) < ^ ( E £ ) E [Tr(E„P N )] . 



fc=l ;'=N 



Therefore 



h ? \ 1 

limsupP ^EE^l^j) > < (7^7^ ^ P ^' 



and (|52l follows. 



Step 2 — The remaining terms are negligible. Now, it remains to prove that all the other 



terms in (43 1 converge in probability to zero. Due to the equivalence ( |T9| , we have to prove 
the convergence in probability to zero of 



Recall that E 
we have: 



— ( ~- N ^W~' thanks to Proposition 



2.2 



For the first term A n = 



In 



E 



h n .. 
ln N — \\A n \ 



< 



C N \n(n) 



,1-7 ' 



therefore W ^f.A n > 0. Let us turn to the second term A' = Ya=2 ?k 

V " n— S-oo 

there exists a constant C such that 



1 i_ 

Ik lk+1 



. Since 



7k 7k+i 



< CJk 7-1 , 



by applying Jensen's inequality together with Proposition 2.2 there is a positive constant C 
such that 



E 



I 1 1 ln A 



< C Jln(nK /2 - 1/2 . 



Therefore \/ ^f.A' — since 7 < 1. 

V " n->oo 
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Finally, for the last term A'„ = Yuk=\ iPk + D^(Z)t)), since on On, ||^fc|| < C r \\Zi — m\\ 2 , 
we have for the part in 8%: 



E 



v " fc=i 



< Crn(n)n ft/2 -T+ 1/2 / 



For the additional term, due to (27 1, we have E [1q n ||D^(Z^)||] < Ctt k so that for some 
positive constant C, 



E 



h 



On \/ IT x " I 



< Cn 1/2 - h/2 ~ hfi . 



The end of the proof follows the same guidelines as in Cardot et al. (2011 1. 
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